The objective of this project was to compile data indicative of current data science skills in demand. We did this by creating multiple tables containing strings from blog posts along with populating data from actual job postings.
In doing this, we sought to answer three questions:
Our first main step was creating a table that populated data based on 1) skills in demand from data blogs and 2) job postings with the skills listed.
We used the rvest package to use a node selector and subset the skills of interest:
# Start by reading a HTML page with read_html():
skills_page <- read_html("https://brainstation.io/career-guides/data-science-resume-examples#what-skills-should-you-put-on-a-data-science-resume")
skills_set1<- skills_page %>% html_nodes("ul:nth-child(4)") %>% html_elements("li") %>% html_text2()
skills_set2<- skills_page %>% html_nodes("ul:nth-child(6)") %>% html_elements("li") %>% html_text2()
From here, we can preview the vector:
writeLines(skills_set1)
## Data Courses
## Data Analytics
## Data Science
## Python
## Cybersecurity
## Data analysis
## Data wrangling
## Data modeling
## Statistics
## Data visualization
## Programming
## Quantitative analysis
## Machine learning
## Machine learning models
## Data mining
## Debugging
## Hypothesis testing
## A/B tests
## Regression
## Research the company, the role, and relevant data skills
## Reference resume templates and samples to build a resume outline
## Add relevant education experience, work experience and data projects to the correct section of your resume
## Highlight experience with machine learning and data tools
## Craft concise bullet points using the action verb + task + result format for each experience, emphasizing data-driven successes
## Have a trusted peer proofread your Data Scientist resume for grammar and spelling to make sure your experience is professionally presented
## Work on projects in a collaborative setting
## Take advantage of our flexible plans and scholarships
## Get access to VIP events and workshops
The next step we took was trimming extra spaces from the words in skill_set1 and skill_set2.
vector1 = c()
for (i in 6:19) {
vector1[i-5] <- as.character(trimws(skills_set1[i], which = c("both")))
}
vector2 = c()
for (i in 3:34) {
vector2[i-2] <- as.character(trimws(skills_set2[i], which = c("both")))
}
Here are the cleaned vectors:
vector1 %>% writeLines()
## Data analysis
## Data wrangling
## Data modeling
## Statistics
## Data visualization
## Programming
## Quantitative analysis
## Machine learning
## Machine learning models
## Data mining
## Debugging
## Hypothesis testing
## A/B tests
## Regression
vector2 %>% writeLines()
## R
## Python
## C
## C++
## C#
## HTML
## Java
## JavaScript
## PHP
## SAS
## SQL
## Scala
## MATLAB
## SQL Server
## NoSQL
## Hadoop
## OpenRefine
## TensorFlow
## Cloudera
## Tableau
## Microsoft Excel
## Octave
## Spark
## PowerBI
## Plotly
## Bokeh
## Matplotlib
## Seaborn
## Keras
## Pytorch
## AWS
## Hive
Soft Skills: We can start by reading a HTML page with read_html():
softskills_page <- read_html("https://zety.com/blog/data-scientist-resume-example")
From here, we can scrap these using node selector
skills_set3<- softskills_page %>% html_nodes("td:nth-child(1) > p > a") %>% html_text2()
skills_set3[3:9]
## [1] "Soft Skills" "Communication" "Collaboration"
## [4] "Creative Thinking" "Critical Thinking" "Problem Solving"
## [7] "Interpersonal Skills"
As before, we now have a vector of soft skills:
vector3 = c()
for (i in 3:9) {
vector3[i-2] <- as.character(trimws(skills_set3[i], which = c("both")))
}
vector3
## [1] "Soft Skills" "Communication" "Collaboration"
## [4] "Creative Thinking" "Critical Thinking" "Problem Solving"
## [7] "Interpersonal Skills"
We can now combine the vectors with technical skills and the vectors containing soft skills.
title_vector <- c(vector1, vector2, vector3)
From there, we manually searched multiple job listing sites and populated tables based on the number of results:
tbl1.siteskills = data.frame(skill_title = title_vector,
indeed = indeed_jobs,
linkedin = linkedin_jobs,
simplyhired = simplyhired_jobs,
ziprecruiter = ziprecruiter_jobs
)
head(tbl1.siteskills) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| skill_title | indeed | simplyhired | ziprecruiter | |
|---|---|---|---|---|
| Data analysis | 46,915 | 580,822 | 27,586 | 1,685,359 |
| Data wrangling | 2,660 | 3,329 | 1,233 | 434 |
| Data modeling | 28,328 | 135,443 | 13,702 | 623,648 |
| Statistics | 2,327 | 216,875 | 44,456 | 288,970 |
| Data visualization | 41,727 | 119,686 | 37,409 | 170,268 |
| Programming | 9,444 | 510,656 | 87,187 | 2,047,037 |
We'll refer to this as Table 1 moving forward. We can export this table into a working directory:
# Get working directory path
path <- getwd()
path
## [1] "/Users/deepikadilip/git/Data_607_Project3"
write.csv(tbl1.siteskills, file.path(path, "skills_cleaned_data.csv"))
From here, we can join this table with existing data we've scraped:
tbl2.skillcat = read.csv('SkillCategories.csv')
head(tbl2.skillcat) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| skill_title | SKILLID | skill_category |
|---|---|---|
| Data analysis | 1 | Fundamentals |
| Data wrangling | 2 | Fundamentals |
| Data modeling | 3 | Fundamentals |
| Statistics | 4 | Mathematics |
| Data visualization | 5 | Fundamentals |
| Programming | 6 | Fundamentals |
tbl3.skillrec = read.csv('RecomendedSkills.csv')
head(tbl3.skillrec) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| Skill | RecommendationID | InJobSkills |
|---|---|---|
| C++ | 1 | C++ |
| Collaboration | 1 | Collaboration |
| Collaboration | 7 | Collaboration |
| Communication | 1 | Communication |
| Communication | 2 | Communication |
| Communication | 3 | Communication |
tbl4.skillsites = read.csv('RecommendationSites.csv')
head(tbl4.skillsites, 2) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| RecommendationID | URL | Site | Title | Year |
|---|---|---|---|---|
| 1 | https://www.thebalancecareers.com/list-of-data-scientist-skills-2062381 | the balance careers | Important Job Skills for Data Scientist | 2021 |
| 2 | https://www.analyticsvidhya.com/blog/2020/11/14-must-have-skills-to-become-a-data-scientist-with-resources/ | Analytics Vidhya | 14 Must-Have Skills to Become a Data Scientist (with Resources!) | 2020 |
|
|
|
|
We can start visualization by looking at the job demand distribution in Table 1. To do this, we'll need to melt the dataset prior to using ggplot:
mlt.tbl1 = pivot_longer(tbl1.siteskills, cols = c("indeed", "linkedin", "simplyhired", "ziprecruiter"), values_to = "count")
mlt.tbl1 = mlt.tbl1 %>% mutate(count = as.integer(gsub("\\,", "", count)))
mlt.tbl1.merged = mlt.tbl1 %>% merge(tbl2.skillcat, by = c("skill_title", "SKILLID"), all.x = T)
mlt.tbl1.merged$name = str_to_sentence(mlt.tbl1.merged$name)
plt.sites = ggplot(mlt.tbl1.merged, aes(skill_title, count, fill = factor(skill_category))) + geom_bar(stat = "identity") + facet_wrap(~name, ncol = 1, scales = "free_y") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom")
ggplotly(plt.sites) #Makes it interactive
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes
This plot is a bit detailed, so we can visualize this distribution in more granular "buckets":
mlt.tbl1.merged$skill_bucket = ifelse(mlt.tbl1.merged$skill_category == "Languages", "Languages", ifelse(mlt.tbl1.merged$skill_category == "Soft Skills", "Soft Skills", "Tech Data"))
plt.sites.aggregated = ggplot(mlt.tbl1.merged, aes(skill_title, count, fill = factor(skill_category))) + geom_bar(stat = "identity") + facet_wrap(~ skill_bucket + name, ncol = 2, scales = "free") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom")
ggplotly(plt.sites.aggregated, width = 800, height = 1500)
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes
If we don't want to stratify by job website and visualize an overview instead, it will appear as the following:
tbl1.per_bucket = mlt.tbl1.merged %>% group_by(skill_title, skill_bucket) %>% summarise(count = sum(count))
plt.bucket = ggplot(tbl1.per_bucket, aes(skill_title, count, fill = factor(skill_bucket))) + geom_bar(stat = "identity") + facet_wrap(~ skill_bucket, ncol = 1, scales = "free") + theme_minimal() + scale_fill_economist() + labs(x = "Skill", y = "Count", fill = "Skill Category") + theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust = 1), legend.position = "bottom")
ggplotly(plt.bucket, width = 800, height = 1500)
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes
As an extra bit, we can run some basic statistical tests across the board to see if any particular skills are listed more often in one website than others. We'll start by running a proportions test and seeing which skillls differ the most across websites:
| indeed | simplyhired | ziprecruiter | <<<<<<< Updated upstream ======= In.Recommended.Skills. | >>>>>>> Stashed changes test.statistic | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Communication | 0.0060592 | <<<<<<< Updated upstream 0.4471800 ======= 0.4471799 >>>>>>> Stashed changes | 0.0040841 | <<<<<<< Updated upstream 0.5426767 | 13330475 ======= 0.5426765 | 4.0e-07 | 18815965 >>>>>>> Stashed changes | |||
| Hypothesis testing | 0.0016622 | 0.0015483 | 0.0010033 | <<<<<<< Updated upstream 0.9957862 | 8818391 ======= 0.9957858 | 4.0e-07 | 11030788 >>>>>>> Stashed changes | |||
| Programming | <<<<<<< Updated upstream 0.0035580 | 0.1923865 | 0.0328472 | 0.7712084 | 5420010 ======= 0.0035579 | 0.1923851 | 0.0328469 | 0.7712029 | 7.2e-06 | 7180952 >>>>>>> Stashed changes |
| Data analysis | 0.0200433 | <<<<<<< Updated upstream 0.2481422 | 0.0117855 | 0.7200290 | 4126554 ======= 0.2481416 | 0.0117854 | 0.7200272 | 2.6e-06 | 5567239 >>>>>>> Stashed changes | |
| Problem Solving | <<<<<<< Updated upstream 0.3125545 | 0.5271276 | 0.1508640 ======= 0.3125529 | 0.5271250 | 0.1508633 >>>>>>> Stashed changes | 0.0094539 | <<<<<<< Updated upstream 2876377 ======= 5.0e-06 | 4506362 >>>>>>> Stashed changes |
We'll do the same for those that differ the least among websites:
prop.table %>% arrange(desc(test.statistic)) %>% tail(5) %>% kable() %>% kable_styling(bootstrap_options = "basic")
| indeed | simplyhired | ziprecruiter | <<<<<<< Updated upstream ======= In.Recommended.Skills. | >>>>>>> Stashed changes test.statistic | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| <<<<<<< Updated upstream Octave | 0.1481013 | 0.7936709 | 0.0417722 | 0.0164557 | 1701.61 ======= Plotly | 0.1740139 | 0.2114683 | 0.1143520 | 0.4998343 | 0.0003315 | 2600.47 >>>>>>> Stashed changes |
| <<<<<<< Updated upstream Plotly | 0.1740716 | 0.2115385 | 0.1143899 | 0.5000000 | 1417.67 ======= Octave | 0.1479140 | 0.7926675 | 0.0417193 | 0.0164349 | 0.0012642 | 2235.63 >>>>>>> Stashed changes |
| Seaborn | <<<<<<< Updated upstream 0.2858482 | 0.3514527 | 0.2052484 | 0.1574508 | 126.02 ======= 0.2855805 | 0.3511236 | 0.2050562 | 0.1573034 | 0.0009363 | 478.18 >>>>>>> Stashed changes | |
| Bokeh | <<<<<<< Updated upstream 0.2821369 | 0.3005008 | 0.1602671 | 0.2570952 | 37.33 ======= 0.2816667 | 0.3000000 | 0.1600000 | 0.2566667 | 0.0016667 | 228.06 >>>>>>> Stashed changes | |
| OpenRefine | <<<<<<< Updated upstream 0.3557692 | 0.1826923 | 0.3269231 | 0.1346154 | 19.38 ======= 0.3523810 | 0.1809524 | 0.3238095 | 0.1333333 | 0.0095238 | 52.26 >>>>>>> Stashed changes |
Seaborn and plotly had similar distributions across the job sites, but Matplotlib was seen more often in ZipRecruiter.Next to categorize the skills in the recommended websites we need to join table 3 and table 2 together, joining on the common skill name. Some skill names are also capitalized which would be counted separately, so we changed the skill name columns in both websites to lower case to avoid this issue.
#Change the skills column to all lower case
tbl3.skillrec$Skill <- tolower(tbl3.skillrec$Skill)
tbl2.skillcat$skill_title <- tolower(tbl2.skillcat$skill_title)
# Join on common skill names, full join or inner?
skillcat_rec <- tbl3.skillrec %>% inner_join(tbl2.skillcat, by = c( "Skill" = "skill_title"))
rec <- skillcat_rec %>% group_by(RecommendationID) %>% count(skill_category)
<<<<<<< Updated upstream
head(rec) %>% kable() %>% kableExtra::kable_styling(bootstrap_options = "basic")
| RecommendationID | skill_category | n |
|---|---|---|
| 1 | Database Management | 4 |
| 1 | Fundamentals | 7 |
| 1 | Languages | 7 |
| 1 | Mathematics | 2 |
| 1 | Soft Skills | 4 |
| 1 | Software | 1 |
## # A tibble: 6 x 3
## # Groups: RecommendationID [1]
## RecommendationID skill_category n
## <int> <chr> <int>
## 1 1 Database Management 4
## 2 1 Fundamentals 7
## 3 1 Languages 7
## 4 1 Mathematics 2
## 5 1 Soft Skills 4
## 6 1 Software 1
>>>>>>> Stashed changes
Next to graph the distribution of skill categories among the 12 chosen websites.
# WIP Plan to make this interactive as well colors are hard to distinguish
ggplot(rec, aes(skill_category ,n, fill = skill_category) ) + geom_col() + facet_grid(rows = vars(RecommendationID), scales= "free_y", switch = "y", space = "free_y") + coord_flip() + labs(x = "RecommendationID", y = "Count") +
theme(axis.text.x = element_text(size = 8, hjust = 1, vjust = 1), axis.text.y = element_blank(), axis.ticks.y=element_blank())
(WIP: Use this to compare recommended skill categories vs job postings, question #2. Fundamentals seem to be a more popular category than in the job site postings. Need annotatoins)
(Maybe we can get a more detailed look of what each skill is in the category that they are looking for for the most popular categories to give us more analysis. I started a little of it to see if its worth investigating.)
skillcat_rec %>% filter(RecommendationID == 1 | RecommendationID == 9, skill_category == "Fundamentals")
## Skill RecommendationID InJobSkills SKILLID
## 1 data analysis 1 Data analysis 1
## 2 data mining 1 Data mining 10
## 3 data modeling 1 Data modeling 3
## 4 data visualization 1 Data visualization 5
## 5 data visualization 9 Data visualization 5
## 6 data wrangling 1 Data wrangling 2
## 7 data wrangling 1 Data wrangling 2
## 8 data wrangling 9 Data wrangling 2
## 9 machine learning 1 Machine learning 8
## 10 machine learning 9 Machine learning 8
## skill_category
## 1 Fundamentals
## 2 Fundamentals
## 3 Fundamentals
## 4 Fundamentals
## 5 Fundamentals
## 6 Fundamentals
## 7 Fundamentals
## 8 Fundamentals
## 9 Fundamentals
## 10 Fundamentals
# Data wrangling appears twice for RecommendationID 1, is this an error?
# WIP Need to add color, maybe a different graphical output type. Trying to see common skills between articles
# Fundamentals
skillcat_rec %>%
filter(RecommendationID == 9 | RecommendationID == 1 & skill_category == "Fundamentals") %>%
ggplot(aes(RecommendationID, Skill)) + geom_count() + facet_grid(cols = vars(RecommendationID)) + scale_x_discrete(breaks = c(1,9))
# WIP categorise skills into the three general terms to potentially answer #3
First, we look at those skills that are in both the Job Postings and in the articles/blogs with recommended skills
color = c("#008080", "#428bca", "#d2afff", "#dab600", "#979aaa")
mlt.tbl1.merged %>%
#change the blanks to NA
mutate(In.Recommended.Skills.=
ifelse(In.Recommended.Skills.=="",
NA, In.Recommended.Skills.)) %>%
mutate(skill_category =
ifelse(skill_category =="Software" |
skill_category =="Database Management" |
skill_category == "Fundamentals" |
skill_category == "Mathematics",
"Fundamentals", skill_category)) %>%
#filter out NAs
#now have job posting skills that were also in recommendations
filter(!is.na(In.Recommended.Skills.)) %>%
treemap (index=c("skill_category","skill_title"),
vSize = "count",
type="index",
fontsize.labels=c(15,12),
fontcolor.labels=c("white","blue"),
fontface.labels=c(2,1),
bg.labels=c("transparent"),
align.labels=list(
c("center", "top"),
c("right", "bottom")),
overlap.labels=0.5,
inflate.labels=F,
border.col=c("black","white"),
border.lwds=c(7,2),
palette = color,
title="Skills by Category in Both",
fontsize.title=10
)
## Warning in if (class(try(col2rgb(bg.labels), silent = TRUE)) == "try-error")
## stop("Invalid bg.labels"): the condition has length > 1 and only the first
## element will be used
Next, we look at those skills that are in the Job Postings BUT NOT in the articles/blogs with recommended skills
color2 = c("#428bca","#d2afff","#dab600","#979aaa","#008080")
#which category has skills in job postings and in recomm
mlt.tbl1.merged %>%
#change the blanks to NA
mutate(In.Recommended.Skills.=
ifelse(In.Recommended.Skills.=="",
NA, In.Recommended.Skills.)) %>%
mutate(skill_category =
ifelse(skill_category =="Software" |
skill_category =="Database Management" |
skill_category == "Fundamentals" |
skill_category == "Mathematics" |
skill_category =="Data Warehouse" |
skill_category =="Visualization",
"Fundamentals", skill_category)) %>%
mutate(skill_category =
ifelse(skill_category =="Python Packages",
"Languages", skill_category)) %>%
#filter to keep only NAs
#now have job posting skills that were NOT in recommendations
filter(is.na(In.Recommended.Skills.)) %>%
treemap (index=c("skill_category","skill_title"),
vSize = "count",
type="index",
fontsize.labels=c(15,12),
fontcolor.labels=c("white","blue"),
fontface.labels=c(2,1),
bg.labels=c("transparent"),
align.labels=list(
c("center", "top"),
c("right", "bottom")),
overlap.labels=0.5,
inflate.labels=F,
border.col=c("black","white"),
border.lwds=c(7,2),
palette = color2,
title="Skills by Category in Posting only",
fontsize.title=10,
)
## Warning in if (class(try(col2rgb(bg.labels), silent = TRUE)) == "try-error")
## stop("Invalid bg.labels"): the condition has length > 1 and only the first
## element will be used
There was a good amount of overlap in the skills for Data Science recommended by articles and the job postings:
The largest group in this case is the Soft Skills, specifically, Communication, Problem Solving, Collaboration and Critical Thinking. After that came Fundamentals, with skills like Programming, Data Analysis and Excel (the top 3 in this category). The smallest group here was Languages, with SQL, Python, Java and C++ as the top 4.
When we look at the skills the articles MISSED recommending, we see that Fundamentals is just slightly bigger than Soft Skills, and Languages is just a bit smaller. Top skills by categories that were MISSED by the articles are:
Fundamentals: Hypothesis Testing
Soft Skills: Interpersonal Skills
Languages: JavaScript, HTML, C# and C